Statistical Inference & Measures of Effect

Adam La Caze
School of Pharmacy
The University of Queensland

May 2023

Introduction

Objectives

  1. Understand the logic of statistical testing in clinical trials
  2. Demonstrate an understanding of key statistical concepts: hypothesis testing, \(p\) values, confidence intervals and power
  3. Be able to describe and interpret common measures of effect used in clinical epidemiology
  4. Be able to critically appraise a clinical trial

Clinical epidemiology

Bias

refers to any systematic error in estimating the effect of a drug, exposure or risk factor on a specified outcome

Random error

variation that occurs within stochastic processes

Sources of error

A map of different types of bias that lead to inaccurate estimation

Critical Appraisal

The task of critical appraisal is to appraise the paper and decide whether the findings are of relevance to your practice.

The simplest advice to give about critical appraisal is simply to read the paper and think about it.

Critical Appraisal Tool (Oxford Centre for Evidence-based Medicine)

  1. Determine the PICO for the study
  2. Appraise the study methods
  3. Interpret the results of the study
  4. Consider external validity

Determining the PICO for a study

PICO Notes
Participants Who are the participants in the study. What were the inclusion/exclusion criteria? Who participated?
Intervention What is the intervention under investigation?
Comparator What standard treatment is the intervention being tested against? Might be placebo or active control.
Outcome What is the primary endpoint/outcome of the study

Appraising Voysey et al. (2021): PICO and primary results

Task

Focusing on the primary endpoint of Voysey et al. (2021):

  1. Identify the PICO
  2. Identify the null hypothesis
  3. Interpret the statistical result

Based on the information available to you, what inference does the study support?

1. Determine the PICO

PICO Notes
Participants Participants of one of four vaccine RCTs. Most participants were adults in professions at risk of Covid-19.
Intervention Two doses of AZ-Oxford vaccine
Comparator Control: meningococcal vaccine or saline
Outcome Virologically confirmed, symptomatic Covid-19 (NAAT-positive swab, plus fever, cough, shortness of breath or anosmia or ageusia

Primary endpoint and analysis plan

The primary outcome was virologically confirmed, symptomatic COVID-19, defined as a NAAT-positive swab combined with at least one qualifying symptom (fever \(\ge\) 37.8°C, cough, shortness of breath, or anosmia or ageusia (103)

… each study had to meet prespecified criteria of having at least five cases eligible for inclusion in the primary outcome before a study was included in efficacy analyses (103)

Vaccine efficacy was calculated as 1 – adjusted relative risk (ChAdOx1 nCoV-19 vs control groups) computed using a Poisson regression model with robust variance (103)

Outcome

There were 30 (0.5%) cases among 5807 participants in the vaccine arm and 101 (1.7%) cases among 5829 participants in the control group, resulting in vaccine efficacy of 70.4% (95.8% CI 54.8–80.6; table 2; figure).

How would you interpret this result?

Statistical inference

Random error

What are your expectations when you toss a fair coin 10 times?

##  [1] 0 1 1 1 1 0 0 1 0 0
##  [1] 1 0 0 0 0 1 0 0 1 0
##  [1] 0 1 1 1 0 0 0 0 0 0

In these series of experiments we saw 5, 3 and 3 \(H\).

##  [1] 5 8 5 5 7 5 7 4 7 4

  • Some 10 coin toss experiments provided 0 \(H\) or 10 \(H\)
  • Most provided an outcome close to 5 \(H\)
  • The mean number of \(H\) across all of these experiments was 4.998

Take home

  1. Random error can make it hard to make inferences from small samples of data.
  2. Increasing the number of repetitions provided mean results much closer to what we expect given the coin is fair.

Task

Imagine you have coin of unknown bias (i.e. the probability of \(H\) is unknown—it is unknown whether the coin is fair, favours \(T\) or favours \(H\)). What test could you conduct to assess whether the coin is fair?

Attempt to describe the hypothesis you are testing and the statistical model you are using for the test.

Logic of statistical testing

  1. Consider what outcomes we would expect if the coin was fair. This is called the null hypothesis.

If the coin was fair and we conducted many repetitions of the 10 coin toss experiment, we would expect results like those seen in the experiment conducted above: a wide range of results—0 \(H\) to 10 \(H\)—but with many more results clustering around 5 \(H\); we also expect a mean result of approximately 5. This expectation, and the assumptions that go along with this expectation, provides a statistical model that we can use to inform our inferences.

  1. Conduct repetitions of the 10 coin toss experiment with the coin of unknown bias.
  1. Use what we know about the statistical model—i.e. the expected distribution of results if the coin was fair—, and considerations regarding the kind of difference that we will consider far enough away from fair to consider important, to determine how many repetitions of the 10 coin toss experiment we would need to do for our results to be reliable.

For example, we might not be too worried about a coin that was biased in favour of \(H\) such that the probability of \(H\) was 0.50002—but if we were, we would need a very large number of repetitions. This is a consideration of study power.

  1. Compare the mean observed result of the repetitions of the 10 coin toss experiments with the coin of unknown bias with what we would expect if the coin was fair.
  2. If the observed mean of the repetitions of the 10 coin toss experiment is considerably larger or smaller than 5, then the study provides some evidence that the coin is not fair.
  1. The probability of observing a result more extreme than the one we did can be calculated using the statistical model assuming the null hypothesis—this is the \(p\) value.

Coin of unknown bias

The mean number of \(H\) for the fair coin was 4.998 and for the coin of unknown bias was 6.9982.

Hypothesis testing

Underpowered tests

Overpowered tests

Interpreting results of tests with different power

Statistically significant result Non-statistically significant result
Adequately powered test Reject the null. Accept the alternative hypothesis The test failed to reject the null. Either the null is true or the effect size is smaller than was tested
Underpowered test Provisionally accept the alternative hypothesis Underdetermined result. The test is unable to detect effect sizes that might be important.

Confidence intervals

95% Confidence interval

Ways to think about a confidence interval…

Precise, but confusing

  • If the study was repeated many times, and the same procedure was used to calculate the 95% confidence interval, in the long run, you would expect the calculated 95% confidence intervals would include the true value of the parameter 95% of the time

  • A 95% confidence interval provides the range of values that are not statistically different from the observed point estimate at the 0.05 level

Less precise, but useful

  • The confidence interval provides a range of plausible values for the unknown parameter

  • The lower limit is a likely lower bound estimate of the parameter; the upper limit a likely upper bound

Incorrect and misleading

  • You can be 95% confident that the true value lies between the observed confidence interval

  • The 95% confidence interval has a 95% chance of including the true effect size

Back to Voysey 2021

Understanding key aspects of the statistical inference

  1. Consider what outcomes we would expect if the vaccine didn’t work and we repeated these studies many times. This is called the null hypothesis and informs the statistical model.
  2. Use what we know about the statistical model assuming the null hypothesis is true to determine how large the studies would need to be to reliably identify a clinically important effect (lower bound of CI \(>\) 20%). This is a consideration of study power.
  3. Conduct the experiment. Compare the observed efficacy with what we would expect if the null hypothesis was true.
  4. If the observed efficacy is considerably larger than 20%, then the study provides evidence that the null hypothesis is false.
  5. The probability of observing a result more extreme than the one we did can be calculated using the statistical model assuming the null hypothesis is true—this is the \(p\) value.
  6. Alternatively, we can use the 95% confidence intervals around the observed efficacy to determine whether the results are statistically (and clinically) significant.

Appraising Voysey et al. (2021) using the Critical Appraisal Tool

2. Appraise the study methods

  • Was the assignment of patients to treatments randomized?

Yes

  • Were the groups similar at the start of the trial

Yes—at least on a per-trial level. The participants in each trial differ quite a bit (see Table 1).

  • Aside from the allocated treatment, were groups treated equally?

Lots of differences (e.g. timing of second dose, dosing, follow-up). That said: methods pre-approved with regulators.

  • Were all patients who entered the trial accounted for? And were they analysed in the groups to which they were randomized?

See appendix for CONSORT participant flow diagram. Participants analysed according to vaccines they received (intention-to-treat analysis conducted as a sensitivity analysis)

  • Were the measures objective or were the patients and clinicians kept “blind” to which treatment was being received?

Measures seem appropriately objective. The studies differed in terms of masking (single, double blind).

3. Interpret the results

  • How large was the treatment effect?

Efficacy: 70.4% (95.8% CI 54.8–80.6)

  • How precise was the estimate of the treatment effect?

This is a judgment call. Given this is an interim result and the lower bound of the confidence interval is considerably greater than 20%, it seems reasonable to suggest the result is sufficiently precise.

4. External validity

  • Will the results help me in caring for my patient?
  • Alternatively: How will I apply the results?

Take homes

  • The most-often used cut-off for \(p\) values used in clinical research is 0.05.
  • If the \(p\) value is \(< 0.05\), the result will be considered statistically significant
  • A statistically significant result for the primary endpoint of a trial is more trustworthy than statistically significant results on secondary endpoints or subgroups—the trial was set up the test the primary endpoint.
  • Once you have determined that the primary endpoint of trial was statistically significant, the next question is to determine whether the magnitude of the effect is clinically significant

Measures of effect

  1. Be able to describe and interpret common measures of effect used in clinical epidemiology

Measuring the effects of interventions

Relative risk

\[RR = \frac{I_t}{I_c}\]

Relative risk reduction (or increase)

\[ RRR = 1 - RR \]

The absolute risk difference (absolute risk reduction or absolute risk increase)

\[ARR = I_c - I_t \]

\[ARI = I_t - I_c\]

The number needed to treat (\(NNT\)) and/or harm (\(NNH\))

\[ NNT = 100/ARR \]

\[ NNT = 100/ARI \]

Voysey et al. (2021)

Endpoint No endpoint Total
Treatment 30 5777 5807
Control 101 5708 5809

\[ I_t = 30/5807 = 0.005 = 0.5\%\]

\[ I_c = 101/5809 = 0.017 = 1.7\% \]

\[RR = \frac{0.5}{1.7} = 0.294\]

\[ RRR = 0.706\]

\[ARR = 1.7 - 0.5 = 1.2\%\]

\[NNT = \frac{100}{1.2} = 83.333\]

Review objectives

  1. Understand the logic of statistical testing in clinical trials
  2. Demonstrate an understanding of key statistical concepts: hypothesis testing, \(p\) values, confidence intervals and power
  3. Be able to describe and interpret common measures of effect used in clinical epidemiology
  4. Be able to critically appraise a clinical trial

References

Centre for Evidence-Based Medicine. (n.d.). Critical Appraisal tools - CEBM. Retrieved March 5, 2018, from https://www.cebm.net/2014/06/critical-appraisal/
Greenhalgh, T. (1997). How to read a paper: Assessing the methodological quality of published papers. BMJ, 315(7103), 305–308. https://doi.org/10.1136/bmj.315.7103.305
Voysey, M., Clemens, S. A. C., Madhi, S. A., Weckx, L. Y., Folegatti, P. M., Aley, P. K., Angus, B., Baillie, V. L., Barnabas, S. L., Bhorat, Q. E., et al. (2021). Safety and efficacy of the ChAdOx1 nCoV-19 vaccine (AZD1222) against SARS-CoV-2: an interim analysis of four randomised controlled trials in Brazil, South Africa, and the UK. The Lancet, 397(10269), 99–111.